Starbucks Capstone Challenge¶

Introduction¶

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

Example¶

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

Cleaning¶

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

Final Advice¶

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

Data Sets¶

The data is contained in three files:

  • portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json - demographic data for each customer
  • transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

  • id (string) - offer id
  • offer_type (string) - type of offer ie BOGO, discount, informational
  • difficulty (int) - minimum required spend to complete an offer
  • reward (int) - reward given for completing an offer
  • duration (int) - time for offer to be open, in days
  • channels (list of strings)

profile.json

  • age (int) - age of the customer
  • became_member_on (int) - date when customer created an app account
  • gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
  • id (str) - customer id
  • income (float) - customer's income

transcript.json

  • event (str) - record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) - customer id
  • time (int) - time in hours since start of test. The data begins at time t=0
  • value - (dict of strings) - either an offer id or transaction amount depending on the record

Note: If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.

You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.

Problem Introduction¶

The purpose of this project is to study consumer behavior towards Starbucks and use it to develop product recommendations to consumers. by collecting basic consumer data shopping behavior including interest in various suggestions which these elements will make it more accessible to consumers

Strategy to solve the problem¶

An effective analysis is required to set goals, understand the overview of the data for the benefit of modeling, summarize the data and expand in the future. I have divided them as follows:

  • Metrics

    • Identify indicators.
  • Exploratory Data Analysis (EDA)

    • Portfolio: verify the information of each columns and separate the contact information. to see the popularity that consumers receive from various channels.
    • Profile: check age, gender and year of membership. to know the popularity of consumers What age are you? What elements are important to a purchase? Does gender difference affect product consumption?
    • Transcript: check out the events that are accepting offers. and customers without database to see what success rate submitting offers. What is the rate of service compared to normal customers and member customers?
  • Modeling

    • Build a model to predict whether the offer is "BOGO" or "DISCOUNT" by using a model of LinearRegression compare with LogisticRegression Set the metrics to accuracy and error.
  • Results
  • Conclusion
  • Improvement

Metrics¶

In this project, my emphasis is on "accuracy" and "error" which shows how well the model performs. We compare the analysis using algorithms from the scikit-learn library to determine the suitability for this data set.

Exploratory Data Analysis (EDA)¶

In [1]:
#import libary.
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px 
import math
import json
import datetime
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics import plot_confusion_matrix
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
In [2]:
# load in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

a. Overviews of the data.¶

a-1. Protfolio details.¶

Overviews of Portfolio, show details of offer_type in id.

  • reward: 10, 5, 3, 2, 0
  • channels: web, email, modile, social
  • difficulty: 20, 10, 7, 5, 0
  • duration: 10, 7, 5, 3, 4, 0
  • offer_type: bogo, discount, informational
  • id: 10 types.
In [3]:
#Check the basics of the data.
portfolio
Out[3]:
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
5 3 [web, email, mobile, social] 7 7 discount 2298d6c36e964ae4a3e7e9706d1fb8c2
6 2 [web, email, mobile, social] 10 10 discount fafdcd668e3743c1bb461111dcafc2a4
7 0 [email, mobile, social] 0 3 informational 5a8bc65990b245e5a138643cd4eb9837
8 5 [web, email, mobile, social] 5 5 bogo f19421c1d4aa40978ebb69ca19b0e20d
9 2 [web, email, mobile] 10 7 discount 2906b810c7d4411798c6938adc9daaa5
In [4]:
portfolio.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 608.0+ bytes
In [5]:
portfolio.shape
Out[5]:
(10, 6)
In [6]:
portfolio.columns
Out[6]:
Index(['reward', 'channels', 'difficulty', 'duration', 'offer_type', 'id'], dtype='object')

a-2. Profile details.¶

Overviews of Profile, show details of customer. There is a total of 1700 data which consists of gender, age, id, become_member_on and income.

  • Gender: F (female), (M) male, None (can't identify).
  • Age: age range 18 - 118 years.
  • Become_member_on: range 2013 - 2018 years.
  • Income: income range is 30000.0 - 120000.0
In [7]:
#Check the basics of the data.
profile.head()
Out[7]:
gender age id became_member_on income
0 None 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 None 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 None 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN
In [8]:
profile.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB
In [9]:
profile.shape
Out[9]:
(17000, 5)
In [10]:
profile.columns
Out[10]:
Index(['gender', 'age', 'id', 'became_member_on', 'income'], dtype='object')
In [11]:
profile.age.min(), profile.age.max()
Out[11]:
(18, 118)
In [12]:
profile.became_member_on.min(), profile.became_member_on.max()
Out[12]:
(20130729, 20180726)
In [13]:
profile.income.min(), profile.income.max()
Out[13]:
(30000.0, 120000.0)
In [14]:
profile.index
Out[14]:
RangeIndex(start=0, stop=17000, step=1)
In [15]:
#Set None of gender is "O"
for x in range(profile.shape[0]):
    if (profile.gender[x] == "F") | (profile.gender[x] == "M"):
        continue
    else:
        profile.gender[x] = "O"
In [16]:
profile.head()
Out[16]:
gender age id became_member_on income
0 O 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 O 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 O 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN

a-3. Transcript details.¶

Overviews of Transcript, show details of customer. There is a total of 306534 data which consists of person, event, value, time.

  • Event: offer received, offer viewed and offer completed.
  • Value: offer id and amount.
  • Time: time range is 0, 714.
In [17]:
#Check the basics of the data.
transcript.head()
Out[17]:
person event value time
0 78afa995795e4d85b5d9ceeca43f5fef offer received {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0
1 a03223e636434f42ac4c3df47e8bac43 offer received {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0
2 e2127556f4f64592b11af22de27a7932 offer received {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0
4 68617ca6246f4fbc85e91a2a49552598 offer received {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0
In [18]:
transcript.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB
In [19]:
transcript.shape
Out[19]:
(306534, 4)
In [20]:
transcript.columns
Out[20]:
Index(['person', 'event', 'value', 'time'], dtype='object')
In [21]:
transcript.time.min(), transcript.time.max()
Out[21]:
(0, 714)

Amount of data¶

In [22]:
#Amount of portfolio, profile and transcript.
print("Amount of portfolio: ", portfolio.shape[0])
print("Amount of profile: ",profile.shape[0])
print("Amount of transcript: ",transcript.shape[0])
Amount of portfolio:  10
Amount of profile:  17000
Amount of transcript:  306534

b. Data Exploration.¶

b-1. Data Exploration of Portfolio.¶

  1. Find the frequency of the data.
  2. Create columns "web", "email", "mobile" and "social" follows as channels.
  3. Create columns "sum" to store channel totals.
In [23]:
#Get data of portfolio ot df_portfolio.
df_portfolio = portfolio
df_portfolio
Out[23]:
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
5 3 [web, email, mobile, social] 7 7 discount 2298d6c36e964ae4a3e7e9706d1fb8c2
6 2 [web, email, mobile, social] 10 10 discount fafdcd668e3743c1bb461111dcafc2a4
7 0 [email, mobile, social] 0 3 informational 5a8bc65990b245e5a138643cd4eb9837
8 5 [web, email, mobile, social] 5 5 bogo f19421c1d4aa40978ebb69ca19b0e20d
9 2 [web, email, mobile] 10 7 discount 2906b810c7d4411798c6938adc9daaa5
In [24]:
#Frequency of reward.
df_portfolio.reward.value_counts()
Out[24]:
5     3
0     2
2     2
10    2
3     1
Name: reward, dtype: int64
In [25]:
#Frequency of difficulty.
df_portfolio.difficulty.value_counts()
Out[25]:
10    4
0     2
5     2
20    1
7     1
Name: difficulty, dtype: int64
In [26]:
#Frequency of duration.
df_portfolio.duration.value_counts()
Out[26]:
7     4
5     2
10    2
3     1
4     1
Name: duration, dtype: int64
In [27]:
#Frequency of offer_type.
df_portfolio.offer_type.value_counts()
Out[27]:
discount         4
bogo             4
informational    2
Name: offer_type, dtype: int64
In [28]:
#Add columns follows as channels.
channels_val = set()
for val in df_portfolio.channels:
    channels_val.update(set(val))

for val in channels_val:
    df_portfolio[val] = df_portfolio.channels.apply(lambda x: 1 if val in x else 0)
    
df_portfolio
Out[28]:
reward channels difficulty duration offer_type id email web social mobile
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0 1 1 1 1
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed 1 1 0 1
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 1 0 1
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 1 0 0
5 3 [web, email, mobile, social] 7 7 discount 2298d6c36e964ae4a3e7e9706d1fb8c2 1 1 1 1
6 2 [web, email, mobile, social] 10 10 discount fafdcd668e3743c1bb461111dcafc2a4 1 1 1 1
7 0 [email, mobile, social] 0 3 informational 5a8bc65990b245e5a138643cd4eb9837 1 0 1 1
8 5 [web, email, mobile, social] 5 5 bogo f19421c1d4aa40978ebb69ca19b0e20d 1 1 1 1
9 2 [web, email, mobile] 10 7 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
In [29]:
#Counts of channels by email, social, web, mobile.
print("Frequency of channels from email: ", df_portfolio.email.sum())
print("Frequency of channels from social: ", df_portfolio.social.sum())
print("Frequency of channels from web: ", df_portfolio.web.sum())
print("Frequency of channels from mobile: ", df_portfolio.mobile.sum())
Frequency of channels from email:  10
Frequency of channels from social:  6
Frequency of channels from web:  8
Frequency of channels from mobile:  9
In [30]:
#The sum of the channels that affect offer_type.
offer_type_by_chans = df_portfolio[['offer_type', 'email', 'social', 'web', 'mobile']]
offer_type_by_chans['sum'] = df_portfolio['email'] + df_portfolio['social'] + df_portfolio['web'] + df_portfolio['mobile']
offer_type_by_chans
Out[30]:
offer_type email social web mobile sum
0 bogo 1 1 0 1 3
1 bogo 1 1 1 1 4
2 informational 1 0 1 1 3
3 bogo 1 0 1 1 3
4 discount 1 0 1 0 2
5 discount 1 1 1 1 4
6 discount 1 1 1 1 4
7 informational 1 1 0 1 3
8 bogo 1 1 1 1 4
9 discount 1 0 1 1 3
In [31]:
#Calucation of each channels.
offer_type_by_chans.iloc[:, :5].describe()
Out[31]:
email social web mobile
count 10.0 10.000000 10.000000 10.000000
mean 1.0 0.600000 0.800000 0.900000
std 0.0 0.516398 0.421637 0.316228
min 1.0 0.000000 0.000000 0.000000
25% 1.0 0.000000 1.000000 1.000000
50% 1.0 1.000000 1.000000 1.000000
75% 1.0 1.000000 1.000000 1.000000
max 1.0 1.000000 1.000000 1.000000
In [32]:
#Sort data.
offer_type_by_chans.sort_values(by=['offer_type'])
Out[32]:
offer_type email social web mobile sum
0 bogo 1 1 0 1 3
1 bogo 1 1 1 1 4
3 bogo 1 0 1 1 3
8 bogo 1 1 1 1 4
4 discount 1 0 1 0 2
5 discount 1 1 1 1 4
6 discount 1 1 1 1 4
9 discount 1 0 1 1 3
2 informational 1 0 1 1 3
7 informational 1 1 0 1 3
In [33]:
#Conclusions.
offer_type_by_chans.groupby(['offer_type']).sum()
Out[33]:
email social web mobile sum
offer_type
bogo 4 3 3 4 14
discount 4 2 4 3 13
informational 2 1 1 2 6

According to the portfolio data,

  • email most affects with offer_type.
  • The most popular offer_type is BOGO.

Data Visualization of Portfolio¶¶

  1. How are reward, difficulty and duration related?
In [34]:
#Relation of reward, difficulty and duration.
plt.figure(figsize=(10,5))
(df_portfolio.duration/df_portfolio.shape[0]).plot(kind='line', marker='o', legend='duration');
(df_portfolio.difficulty/df_portfolio.shape[0]).plot(kind='line', marker='o', legend='difficulty');
(df_portfolio.reward/df_portfolio.shape[0]).plot(kind='line', marker='o', legend='reward');
plt.ylabel('Score(%)', fontsize = 12)
plt.xlabel('Index', fontsize = 12)
plt.title('Relation of reward, difficulty and duration.');
  1. How popular is the communication channel?
In [35]:
#The most impactful communication channel.
(offer_type_by_chans.groupby(['offer_type']).sum().iloc[:, :4]).plot(kind='box', figsize=(10,5));
plt.ylabel('Frequency of channels.', fontsize = 12);
plt.xlabel('Channels.', fontsize = 12);
plt.title('Relation of channels and frequency.', fontsize = 15);

As above, email communication methods are the most popular and social communication channels are the least popular.

b-2. Data Exploration of Profile.¶

  1. Find the frequency of the data.
  2. Separate "day", "month", "year" from column "became_member_on" and drop column "became_member_on".
  3. Delete unidentified information: According to the gender is "None". This corresponds to people aged 118 years and unable to determine income.
  4. Create a "val" column to use for calculating totals.
  5. Create a "range_of_age" column to keep range of age.
In [36]:
#Get data of portfolio ot df_portfolio.
df_profile = profile
df_profile.head()
Out[36]:
gender age id became_member_on income
0 O 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 O 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 O 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN
In [37]:
df_profile.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            17000 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB
In [38]:
df_profile['Date'] = df_profile.became_member_on.apply(lambda x: datetime.datetime.strptime(str(x), '%Y%m%d'))
df_profile['year'] = df_profile.became_member_on.astype(str).apply(lambda x: int((x)[:4]))
df_profile['month'] = df_profile.became_member_on.astype(str).apply(lambda x: int((x)[4:6]))
df_profile['days'] = df_profile.became_member_on.astype(str).apply(lambda x: int((x)[6:]))
df_profile.head()
Out[38]:
gender age id became_member_on income Date year month days
0 O 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN 2017-02-12 2017 2 12
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0 2017-07-15 2017 7 15
2 O 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN 2018-07-12 2018 7 12
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0 2017-05-09 2017 5 9
4 O 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN 2017-08-04 2017 8 4
In [39]:
#Counts of gender.
df_profile.gender.value_counts()
Out[39]:
M    8484
F    6129
O    2387
Name: gender, dtype: int64
In [40]:
#Counts of age.
df_profile.age.value_counts()
Out[40]:
118    2175
58      408
53      372
51      363
54      359
       ... 
100      12
96        8
98        5
99        5
101       5
Name: age, Length: 85, dtype: int64
In [41]:
#Values of income.
df_profile.income.value_counts()
Out[41]:
73000.0     314
72000.0     297
71000.0     294
57000.0     288
74000.0     282
           ... 
116000.0     46
112000.0     45
107000.0     45
117000.0     32
120000.0     13
Name: income, Length: 91, dtype: int64
In [42]:
#Counts of year.
df_profile.year.value_counts()
Out[42]:
2017    6469
2018    4198
2016    3526
2015    1830
2014     691
2013     286
Name: year, dtype: int64
In [43]:
#Counts of month.
df_profile.month.value_counts().sort_index(ascending=True)
Out[43]:
1     1525
2     1202
3     1329
4     1315
5     1307
6     1265
7     1359
8     1610
9     1515
10    1568
11    1449
12    1556
Name: month, dtype: int64
In [44]:
# Select gender is male and female.
df_profile[(df_profile['gender'] == 'F') | (df_profile['gender'] == 'M')].head()
Out[44]:
gender age id became_member_on income Date year month days
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0 2017-07-15 2017 7 15
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0 2017-05-09 2017 5 9
5 M 68 e2127556f4f64592b11af22de27a7932 20180426 70000.0 2018-04-26 2018 4 26
8 M 65 389bc3fa690240e798340f5a15918d5c 20180209 53000.0 2018-02-09 2018 2 9
12 M 58 2eeac8d8feae4a8cad5a6af0499a211d 20171111 51000.0 2017-11-11 2017 11 11
In [45]:
#Check for unspecified values.
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')].head()
Out[45]:
gender age id became_member_on income Date year month days
0 O 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN 2017-02-12 2017 2 12
2 O 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN 2018-07-12 2018 7 12
4 O 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN 2017-08-04 2017 8 4
6 O 118 8ec6ce2a7e7949b1bf142def7d0e0586 20170925 NaN 2017-09-25 2017 9 25
7 O 118 68617ca6246f4fbc85e91a2a49552598 20171002 NaN 2017-10-02 2017 10 2

According to the result above, age is 118 year ago and income is null.

In [46]:
#The number of times the gender could not be determined.
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')]['gender'].value_counts()[0]
Out[46]:
2387
In [47]:
#Number of purchases unable to determine gender.
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')]['age'].value_counts().count()
Out[47]:
67

Income of gender is missing data.

  • All missing income data is it in a variable for which gender has not been determined?
In [48]:
#The buyer's income could not be determined.
df_profile['income'].isnull().value_counts()[1]
Out[48]:
2175
In [49]:
df_profile[(df_profile['gender'] != 'F') & (df_profile['gender'] != 'M')]['income'].isnull().value_counts()
Out[49]:
True     2175
False     212
Name: income, dtype: int64
In [50]:
df_profile[(df_profile['gender'] == 'F') | (df_profile['gender'] == 'M')]['income'].isnull().value_counts()
Out[50]:
False    14613
Name: income, dtype: int64

Accoring to the above result, All missing income data within gerder has not been determined. Therefore, consider to removing gender is O (can't identify).

In [51]:
#Clean the data by removing gender = None.
df_profile = df_profile[(df_profile['gender'] == 'F') | (df_profile['gender'] == 'M')]
df_profile.head()
Out[51]:
gender age id became_member_on income Date year month days
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0 2017-07-15 2017 7 15
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0 2017-05-09 2017 5 9
5 M 68 e2127556f4f64592b11af22de27a7932 20180426 70000.0 2018-04-26 2018 4 26
8 M 65 389bc3fa690240e798340f5a15918d5c 20180209 53000.0 2018-02-09 2018 2 9
12 M 58 2eeac8d8feae4a8cad5a6af0499a211d 20171111 51000.0 2017-11-11 2017 11 11
In [52]:
#Sort data  #became_member_on
df_profile = df_profile.sort_values(by=['Date'], ascending=False)
df_profile = df_profile.drop(labels='became_member_on', axis=1)
df_profile = df_profile.rename(columns={'Date': 'became_member_on'})
df_profile["val"] = 1
In [53]:
#Check the data.
df_profile.head()
Out[53]:
gender age id income became_member_on year month days val
10282 M 46 e4989ca3d8974e28a3fe87aa0b0ce327 74000.0 2018-07-26 2018 7 26 1
14757 M 59 3d50b3815c8e4f3b84f46be4d1b64cd3 34000.0 2018-07-26 2018 7 26 1
12902 M 85 f1f01be4e2344bf08b7f488c115ad374 45000.0 2018-07-26 2018 7 26 1
4357 M 76 6597e5ab63754726b83ba2032f008d26 92000.0 2018-07-26 2018 7 26 1
13249 F 45 42f00b6b150e45149c50db9bd642bca6 70000.0 2018-07-26 2018 7 26 1
In [54]:
#Group by year.
df_profile[['age', 'year', 'income','val']].groupby(['year']).sum()
Out[54]:
age income val
year
2013 14207 15597000.0 271
2014 33754 38296000.0 658
2015 85799 105177000.0 1568
2016 169872 211236000.0 2992
2017 298867 362194000.0 5510
2018 192352 223712000.0 3614
In [55]:
#age and income by gender and year.
df_profile[['gender','age', 'year', 'income', 'val']].groupby(['gender','year']).sum()
Out[55]:
age income val
gender year
F 2013 4202 4567000.0 74
2014 7343 8468000.0 138
2015 38083 47701000.0 663
2016 88385 112994000.0 1509
2017 140503 174463000.0 2443
2018 74177 88844000.0 1302
M 2013 10005 11030000.0 197
2014 26411 29828000.0 520
2015 47716 57476000.0 905
2016 81487 98242000.0 1483
2017 158364 187731000.0 3067
2018 118175 134868000.0 2312
In [56]:
#Data of age.
df_profile.age
Out[56]:
10282    46
14757    59
12902    85
4357     76
13249    45
         ..
10933    67
1200     52
7796     58
569      36
8948     64
Name: age, Length: 14613, dtype: int64
In [57]:
#keep the age range.
range_age = []
for val in df_profile.age:
    if val >= 100:
        cat = 10
    elif val >= 90:
        cat = 9
    elif val >= 80:
        cat = 8
    elif val >= 70:
        cat = 7
    elif val >= 60:
        cat = 6
    elif val >= 50:
        cat = 5
    elif val >= 40:
        cat = 4
    elif val >= 30:
        cat = 3
    elif val >= 20:
        cat = 2
    elif val >= 10:
        cat = 1
    elif val >= 0:
        cat = 0
    
    range_age.append(cat)
In [58]:
len(range_age)
Out[58]:
14613
In [59]:
#Add columns the range_of_age in df_profile.
df_profile['range_of_age'] = range_age
In [60]:
#Keep data of range age.
age_range = {0:'Age: 0-9',
             1:'Age: 10-19',
             2:'Age: 20-29',
             3:'Age: 30-39',
             4:'Age: 40-49',
             5:'Age: 50-59',
             6:'Age: 60-69',
             7:'Age: 70-79',
             8:'Age: 80-89',
             9:'Age: 90-99',
             10:'Age: >=100'}
In [61]:
#Check the data.
df_profile.head()
Out[61]:
gender age id income became_member_on year month days val range_of_age
10282 M 46 e4989ca3d8974e28a3fe87aa0b0ce327 74000.0 2018-07-26 2018 7 26 1 4
14757 M 59 3d50b3815c8e4f3b84f46be4d1b64cd3 34000.0 2018-07-26 2018 7 26 1 5
12902 M 85 f1f01be4e2344bf08b7f488c115ad374 45000.0 2018-07-26 2018 7 26 1 8
4357 M 76 6597e5ab63754726b83ba2032f008d26 92000.0 2018-07-26 2018 7 26 1 7
13249 F 45 42f00b6b150e45149c50db9bd642bca6 70000.0 2018-07-26 2018 7 26 1 4
In [62]:
df_profile.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14613 entries, 10282 to 8948
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   gender            14613 non-null  object        
 1   age               14613 non-null  int64         
 2   id                14613 non-null  object        
 3   income            14613 non-null  float64       
 4   became_member_on  14613 non-null  datetime64[ns]
 5   year              14613 non-null  int64         
 6   month             14613 non-null  int64         
 7   days              14613 non-null  int64         
 8   val               14613 non-null  int64         
 9   range_of_age      14613 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(6), object(2)
memory usage: 1.2+ MB
In [63]:
#Consumer income by age and gender.
df_profile[['range_of_age','gender', 'income', 'val']].groupby(['range_of_age','gender']).sum().sort_index()
Out[63]:
income val
range_of_age gender
1 F 2388000.0 45
M 7962000.0 160
2 F 20372000.0 393
M 48819000.0 960
3 F 27136000.0 495
M 53498000.0 1008
4 F 54590000.0 835
M 85121000.0 1434
5 F 119579000.0 1560
M 127330000.0 1925
6 F 102946000.0 1350
M 103758000.0 1602
7 F 63568000.0 832
M 61054000.0 925
8 F 34019000.0 450
M 25181000.0 371
9 F 11413000.0 155
M 6291000.0 97
10 F 1026000.0 14
M 161000.0 2
In [64]:
#Calculation
df_profile[['range_of_age','gender', 'income']].groupby(['gender','range_of_age']).describe()
Out[64]:
income
count mean std min 25% 50% 75% max
gender range_of_age
F 1 45.0 53066.666667 12393.913462 31000.0 42000.0 55000.0 62000.0 73000.0
2 393.0 51837.150127 12750.107803 30000.0 41000.0 52000.0 62000.0 75000.0
3 495.0 54820.202020 15663.141575 30000.0 42000.0 53000.0 66000.0 100000.0
4 835.0 65377.245509 19016.658344 30000.0 53000.0 64000.0 78000.0 119000.0
5 1560.0 76653.205128 21942.156507 30000.0 59000.0 77000.0 93000.0 120000.0
6 1350.0 76256.296296 21957.694371 30000.0 60000.0 76000.0 93000.0 120000.0
7 832.0 76403.846154 22446.682588 30000.0 60000.0 77000.0 93000.0 119000.0
8 450.0 75597.777778 22548.349328 31000.0 60000.0 75000.0 92000.0 120000.0
9 155.0 73632.258065 21707.326956 31000.0 57000.0 74000.0 89000.0 119000.0
10 14.0 73285.714286 24408.159328 42000.0 56750.0 70500.0 92500.0 118000.0
M 1 160.0 49762.500000 12653.330036 30000.0 39000.0 49000.0 59000.0 74000.0
2 960.0 50853.125000 13194.079099 30000.0 39000.0 50000.0 62000.0 75000.0
3 1008.0 53073.412698 14791.989279 30000.0 41000.0 52000.0 64000.0 100000.0
4 1434.0 59359.135286 17904.609716 30000.0 45000.0 58000.0 71000.0 119000.0
5 1925.0 66145.454545 21157.749163 30000.0 50000.0 65000.0 81000.0 120000.0
6 1602.0 64767.790262 21398.973426 30000.0 49000.0 63000.0 79000.0 120000.0
7 925.0 66004.324324 21666.549684 30000.0 50000.0 64000.0 81000.0 119000.0
8 371.0 67873.315364 21250.694477 30000.0 52000.0 67000.0 81500.0 120000.0
9 97.0 64855.670103 21965.877748 30000.0 51000.0 60000.0 81000.0 120000.0
10 2.0 80500.000000 24748.737342 63000.0 71750.0 80500.0 89250.0 98000.0

Data Visualization of Profile.¶

  1. Popularity of using statucks each year.
In [65]:
#Popularity of using statucks each year.
fig = px.histogram(data_frame=df_profile, x="year", y='income', color="year", pattern_shape="gender")
fig.update_layout(title_text='Relation of year and income.', bargap=0.1)
fig.show()
  1. Range of age and income.
In [66]:
#Range of age and income.
fig = px.histogram(data_frame=df_profile, x="range_of_age", y='income')
fig.update_layout(title_text='Relation of age range and income.', bargap=0.1)
fig.show()
  1. Ralation of year, age and gender for Starbucks.
In [67]:
#Ralation of year, age and gender for Starbucks.
fig = px.strip(df_profile, x="age", y="year", orientation="h", color="gender")
fig.show()
  1. Ralation of range_of_age, year, income and gender for Starbucks.
In [68]:
#Ralation of range_of_age, year, income and gender for Starbucks.
fig = px.scatter_matrix(df_profile, dimensions=["range_of_age", "year", "income"], color="gender")
fig.show()

b-3. Data Exploration of Portfolio. Transcript.¶

  1. Find the frequency of the data.
  2. Separate "value_keys", "values" from column "values" and drop column "values". which values is offer_id types.
  3. Set transcript_event by event is transaction.
  4. Set transcript_value_keys by value_keys is amount.
In [69]:
#Get data of transcript ot df_transcript.
df_transcript = transcript
df_transcript.head()
Out[69]:
person event value time
0 78afa995795e4d85b5d9ceeca43f5fef offer received {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0
1 a03223e636434f42ac4c3df47e8bac43 offer received {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0
2 e2127556f4f64592b11af22de27a7932 offer received {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0
4 68617ca6246f4fbc85e91a2a49552598 offer received {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0
In [70]:
#Get keys of value.
df_transcript.value.apply(lambda x: list(x.keys())[0])
Out[70]:
0         offer id
1         offer id
2         offer id
3         offer id
4         offer id
            ...   
306529      amount
306530      amount
306531      amount
306532      amount
306533      amount
Name: value, Length: 306534, dtype: object
In [71]:
#Get values of value.
df_transcript.value.apply(lambda x: list(x.values())[0])
Out[71]:
0         9b98b8c7a33c4b65b9aebfe6a799e6d9
1         0b1e1539f2cc45b7b9fa7c272da2e1d7
2         2906b810c7d4411798c6938adc9daaa5
3         fafdcd668e3743c1bb461111dcafc2a4
4         4d5c57ea9a6940dd891ad53e9dbe8da0
                        ...               
306529                                1.59
306530                                9.53
306531                                3.61
306532                                3.53
306533                                4.05
Name: value, Length: 306534, dtype: object
In [72]:
#Add value_keys and values from value in df_transcript.
df_transcript['value_keys'] = df_transcript.value.apply(lambda x: list(x.keys())[0]) 
df_transcript['values'] = df_transcript.value.apply(lambda x: list(x.values())[0]) 
df_transcript.head()
Out[72]:
person event value time value_keys values
0 78afa995795e4d85b5d9ceeca43f5fef offer received {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0 offer id 9b98b8c7a33c4b65b9aebfe6a799e6d9
1 a03223e636434f42ac4c3df47e8bac43 offer received {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0 offer id 0b1e1539f2cc45b7b9fa7c272da2e1d7
2 e2127556f4f64592b11af22de27a7932 offer received {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0 offer id 2906b810c7d4411798c6938adc9daaa5
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0 offer id fafdcd668e3743c1bb461111dcafc2a4
4 68617ca6246f4fbc85e91a2a49552598 offer received {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0 offer id 4d5c57ea9a6940dd891ad53e9dbe8da0
In [73]:
#Drop value table.
df_transcript = df_transcript.drop(labels='value', axis=1)
df_transcript.head()
Out[73]:
person event time value_keys values
0 78afa995795e4d85b5d9ceeca43f5fef offer received 0 offer id 9b98b8c7a33c4b65b9aebfe6a799e6d9
1 a03223e636434f42ac4c3df47e8bac43 offer received 0 offer id 0b1e1539f2cc45b7b9fa7c272da2e1d7
2 e2127556f4f64592b11af22de27a7932 offer received 0 offer id 2906b810c7d4411798c6938adc9daaa5
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received 0 offer id fafdcd668e3743c1bb461111dcafc2a4
4 68617ca6246f4fbc85e91a2a49552598 offer received 0 offer id 4d5c57ea9a6940dd891ad53e9dbe8da0
In [74]:
#Get columns.
df_transcript.columns
Out[74]:
Index(['person', 'event', 'time', 'value_keys', 'values'], dtype='object')
In [75]:
#The number of values in df_transcript.
df_transcript.shape
Out[75]:
(306534, 5)
In [76]:
#The number of each person.
df_transcript.person.value_counts()
Out[76]:
94de646f7b6041228ca7dec82adb97d2    51
8dbfa485249f409aa223a2130f40634a    49
d0a80415b84c4df4908b8403b19765e3    48
5e60c6aa3b834e44b822ea43a3efea26    48
79d9d4f86aca4bed9290350fb43817c2    48
                                    ..
fccc9279ba56411f80ffe8ce7e0935cd     2
df9fc9a86ca84ef5aedde8925d5838ba     2
cae5e211053f4121a389a7da4d631f7f     2
e63e42480aae4ede9f07cac49c8c3f78     2
da7a7c0dcfcb41a8acc7864a53cf60fb     1
Name: person, Length: 17000, dtype: int64
In [77]:
#The number of each event.
df_transcript.event.value_counts()
Out[77]:
transaction        138953
offer received      76277
offer viewed        57725
offer completed     33579
Name: event, dtype: int64
In [78]:
#Not offer.
(df_transcript.event == 'transaction').sum() / (df_transcript.event.value_counts().sum() - (df_transcript.event == 'transaction').sum())*100
Out[78]:
82.91691778900949
In [79]:
#An offer that received a response.
x = df_transcript.event.value_counts().sum() - (df_transcript.event == 'transaction').sum()
((df_transcript.event == 'offer viewed').sum() / x ) * 100
Out[79]:
34.44602908444275
In [80]:
#An offer that has received a response has been completed.
x = df_transcript.event.value_counts().sum() - (df_transcript.event == 'transaction').sum()
((df_transcript.event == 'offer completed').sum() / x ) * 100
print(x)
167581
In [81]:
#The number of each time.
df_transcript.time.value_counts()
Out[81]:
408    17030
576    17015
504    16822
336    16302
168    16150
       ...  
318      940
330      938
156      914
162      910
150      894
Name: time, Length: 120, dtype: int64
In [82]:
#The number of each value_keys.
df_transcript.value_keys.value_counts()
Out[82]:
amount      138953
offer id    134002
offer_id     33579
Name: value_keys, dtype: int64
In [83]:
#The number of each values.
df_transcript['values'].value_counts()
Out[83]:
fafdcd668e3743c1bb461111dcafc2a4    20241
2298d6c36e964ae4a3e7e9706d1fb8c2    20139
f19421c1d4aa40978ebb69ca19b0e20d    19131
4d5c57ea9a6940dd891ad53e9dbe8da0    18222
ae264e3637204a6fb9bb56bc8210ddfd    18062
                                    ...  
380.24                                  1
378.35                                  1
138.36                                  1
87.71                                   1
43.29                                   1
Name: values, Length: 5113, dtype: int64
In [84]:
#The data of event and values with separate id and amount.
df_transcript[['event', 'values']].value_counts()
Out[84]:
event           values                          
offer received  9b98b8c7a33c4b65b9aebfe6a799e6d9    7677
                0b1e1539f2cc45b7b9fa7c272da2e1d7    7668
                ae264e3637204a6fb9bb56bc8210ddfd    7658
                2298d6c36e964ae4a3e7e9706d1fb8c2    7646
                2906b810c7d4411798c6938adc9daaa5    7632
                                                    ... 
transaction     49.2                                   1
                49.23                                  1
                49.38                                  1
                49.48                                  1
                1062.28                                1
Length: 5131, dtype: int64
In [85]:
df_transcript.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   person      306534 non-null  object
 1   event       306534 non-null  object
 2   time        306534 non-null  int64 
 3   value_keys  306534 non-null  object
 4   values      306534 non-null  object
dtypes: int64(1), object(4)
memory usage: 11.7+ MB
In [86]:
#Found missing the data.
df_transcript.isnull().value_counts()
Out[86]:
person  event  time   value_keys  values
False   False  False  False       False     306534
dtype: int64
In [87]:
#Set transcript_event by event is transaction.
transcript_event = df_transcript[df_transcript['event'] == 'transaction']
transcript_event.head()
Out[87]:
person event time value_keys values
12654 02c083884c7d45b39cc68e1314fec56c transaction 0 amount 0.83
12657 9fa9ae8f57894cc9a3b8a9bbe0fc1b2f transaction 0 amount 34.56
12659 54890f68699049c2a04d415abc25e717 transaction 0 amount 13.23
12670 b2f1cd155b864803ad8334cdf13c4bd2 transaction 0 amount 19.51
12671 fe97aa22dd3e48c8b143116a8403dd52 transaction 0 amount 18.97
In [88]:
transcript_event.shape
Out[88]:
(138953, 5)
In [89]:
#Set transcript_value_keys by value_keys is amount.
transcript_value_keys = df_transcript[df_transcript['value_keys'] == 'amount']
transcript_value_keys.head()
Out[89]:
person event time value_keys values
12654 02c083884c7d45b39cc68e1314fec56c transaction 0 amount 0.83
12657 9fa9ae8f57894cc9a3b8a9bbe0fc1b2f transaction 0 amount 34.56
12659 54890f68699049c2a04d415abc25e717 transaction 0 amount 13.23
12670 b2f1cd155b864803ad8334cdf13c4bd2 transaction 0 amount 19.51
12671 fe97aa22dd3e48c8b143116a8403dd52 transaction 0 amount 18.97
In [90]:
transcript_value_keys.shape
Out[90]:
(138953, 5)

From transcript_event and transcript_value_keys tell have a relationship with each other.

Data Visualization of Transcript.¶

In [91]:
# The number of each event.
plt.figure(figsize=(10,5))
df_transcript.event.value_counts().plot(kind='bar');
plt.ylabel('Sum of event.', fontsize = 12)
plt.xlabel('Event.', fontsize = 12)
plt.title('Amount of event', fontsize = 15);
In [92]:
# Check the first 5 lines of df_portfolio.
df_portfolio
Out[92]:
reward channels difficulty duration offer_type id email web social mobile
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0 1 1 1 1
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed 1 1 0 1
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 1 0 1
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 1 0 0
5 3 [web, email, mobile, social] 7 7 discount 2298d6c36e964ae4a3e7e9706d1fb8c2 1 1 1 1
6 2 [web, email, mobile, social] 10 10 discount fafdcd668e3743c1bb461111dcafc2a4 1 1 1 1
7 0 [email, mobile, social] 0 3 informational 5a8bc65990b245e5a138643cd4eb9837 1 0 1 1
8 5 [web, email, mobile, social] 5 5 bogo f19421c1d4aa40978ebb69ca19b0e20d 1 1 1 1
9 2 [web, email, mobile] 10 7 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
In [93]:
# Check the first 5 lines of df_profile.
df_profile.head()
Out[93]:
gender age id income became_member_on year month days val range_of_age
10282 M 46 e4989ca3d8974e28a3fe87aa0b0ce327 74000.0 2018-07-26 2018 7 26 1 4
14757 M 59 3d50b3815c8e4f3b84f46be4d1b64cd3 34000.0 2018-07-26 2018 7 26 1 5
12902 M 85 f1f01be4e2344bf08b7f488c115ad374 45000.0 2018-07-26 2018 7 26 1 8
4357 M 76 6597e5ab63754726b83ba2032f008d26 92000.0 2018-07-26 2018 7 26 1 7
13249 F 45 42f00b6b150e45149c50db9bd642bca6 70000.0 2018-07-26 2018 7 26 1 4
In [94]:
# # Check the first 5 lines of df_transcript.
df_transcript.head()
Out[94]:
person event time value_keys values
0 78afa995795e4d85b5d9ceeca43f5fef offer received 0 offer id 9b98b8c7a33c4b65b9aebfe6a799e6d9
1 a03223e636434f42ac4c3df47e8bac43 offer received 0 offer id 0b1e1539f2cc45b7b9fa7c272da2e1d7
2 e2127556f4f64592b11af22de27a7932 offer received 0 offer id 2906b810c7d4411798c6938adc9daaa5
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received 0 offer id fafdcd668e3743c1bb461111dcafc2a4
4 68617ca6246f4fbc85e91a2a49552598 offer received 0 offer id 4d5c57ea9a6940dd891ad53e9dbe8da0

Shape of the data of each dataset.

In [95]:
print(df_portfolio.shape)
print(df_profile.shape)
print(df_transcript.shape)
(10, 10)
(14613, 10)
(306534, 5)

Marge data of Dataframe as follows:¶

  • id of df_portfolio to values of df_transcript
  • id of df_profile to person of df_transcript
In [96]:
# Marge Dataframe of df_transcript and df_profile.
df_trans_prof = df_transcript.merge(df_profile, right_on='id', how='right', left_on='person')
df_trans_prof = df_trans_prof.drop(columns=['month','days','id'], axis=0)
df_trans_prof.head()
Out[96]:
person event time value_keys values gender age income became_member_on year val range_of_age
0 e4989ca3d8974e28a3fe87aa0b0ce327 offer received 0 offer id 3f207df678b143eea3cee63160fa8bed M 46 74000.0 2018-07-26 2018 1 4
1 e4989ca3d8974e28a3fe87aa0b0ce327 offer viewed 6 offer id 3f207df678b143eea3cee63160fa8bed M 46 74000.0 2018-07-26 2018 1 4
2 e4989ca3d8974e28a3fe87aa0b0ce327 transaction 78 amount 11.0 M 46 74000.0 2018-07-26 2018 1 4
3 e4989ca3d8974e28a3fe87aa0b0ce327 transaction 102 amount 29.2 M 46 74000.0 2018-07-26 2018 1 4
4 e4989ca3d8974e28a3fe87aa0b0ce327 offer received 168 offer id 0b1e1539f2cc45b7b9fa7c272da2e1d7 M 46 74000.0 2018-07-26 2018 1 4
In [97]:
# Unanswered events.
df_trans_prof[df_trans_prof['value_keys'] == 'amount']['event'].value_counts()
Out[97]:
transaction    122176
Name: event, dtype: int64
In [98]:
#The transaction is not received offer.
df_trans_prof[(df_trans_prof['event'] == 'transaction') & (df_trans_prof['value_keys'] != 'amount')]['event'].value_counts()
Out[98]:
Series([], Name: event, dtype: int64)
In [99]:
#The event received an offer
df_trans_prof[df_trans_prof['value_keys'] != 'amount']['event'].value_counts()
Out[99]:
offer received     65585
offer viewed       49087
offer completed    31943
Name: event, dtype: int64
In [100]:
# Frequency of use of services by range_of_age.
df_trans_prof.range_of_age.value_counts()
Out[100]:
5     62749
6     53146
4     41713
7     31037
3     29794
2     26669
8     14991
9      4417
1      3962
10      313
Name: range_of_age, dtype: int64
In [101]:
# Events that affect the age range by gender.
df_trans_prof.groupby(['event','gender'])[['age','income']].sum()
Out[101]:
age income
event gender
offer completed F 897242 1.134670e+09
M 886192 1.085293e+09
offer received F 1577891 1.955927e+09
M 1987588 2.333494e+09
offer viewed F 1194201 1.475803e+09
M 1487631 1.756511e+09
transaction F 2747515 3.299407e+09
M 3692523 4.257817e+09
In [102]:
# Check null in df_trans_prof of income table.
df_trans_prof[df_trans_prof.income.isnull() == True].value_counts()
Out[102]:
Series([], dtype: int64)

Find people, The person receiving the offer.

In [103]:
# Marge Dataframe of df_transcript and df_portfolio.
df_trans_sum = df_transcript.merge(df_portfolio, right_on='id', how='right', left_on='values')
df_trans_sum = df_trans_sum.drop(columns=['channels','values'], axis=0)
df_trans_sum.head()
Out[103]:
person event time value_keys reward difficulty duration offer_type id email web social mobile
0 4b0da7e80e5945209a1fdddfe813dbe0 offer received 0 offer id 10 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
1 1e9420836d554513ab90eba98552d0a9 offer received 0 offer id 10 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
2 02c083884c7d45b39cc68e1314fec56c offer received 0 offer id 10 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
3 676506bad68e4161b9bbaffeb039626b offer received 0 offer id 10 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
4 fe8264108d5b4f198453bbb1fa7ca6c9 offer received 0 offer id 10 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
In [104]:
# Puplarity of offer_types.
df_trans_sum.offer_type.value_counts()
Out[104]:
bogo             71617
discount         69898
informational    26066
Name: offer_type, dtype: int64

Types of offer_type with getting an reward.¶

In [105]:
df_trans_sum.groupby(['offer_type','id'])[['event']].count()
Out[105]:
event
offer_type id
bogo 4d5c57ea9a6940dd891ad53e9dbe8da0 18222
9b98b8c7a33c4b65b9aebfe6a799e6d9 16202
ae264e3637204a6fb9bb56bc8210ddfd 18062
f19421c1d4aa40978ebb69ca19b0e20d 19131
discount 0b1e1539f2cc45b7b9fa7c272da2e1d7 13751
2298d6c36e964ae4a3e7e9706d1fb8c2 20139
2906b810c7d4411798c6938adc9daaa5 15767
fafdcd668e3743c1bb461111dcafc2a4 20241
informational 3f207df678b143eea3cee63160fa8bed 11761
5a8bc65990b245e5a138643cd4eb9837 14305
In [106]:
df_trans_sum.groupby(['offer_type','id'])[['email','mobile','web','social']].sum()
Out[106]:
email mobile web social
offer_type id
bogo 4d5c57ea9a6940dd891ad53e9dbe8da0 18222 18222 18222 18222
9b98b8c7a33c4b65b9aebfe6a799e6d9 16202 16202 16202 0
ae264e3637204a6fb9bb56bc8210ddfd 18062 18062 0 18062
f19421c1d4aa40978ebb69ca19b0e20d 19131 19131 19131 19131
discount 0b1e1539f2cc45b7b9fa7c272da2e1d7 13751 0 13751 0
2298d6c36e964ae4a3e7e9706d1fb8c2 20139 20139 20139 20139
2906b810c7d4411798c6938adc9daaa5 15767 15767 15767 0
fafdcd668e3743c1bb461111dcafc2a4 20241 20241 20241 20241
informational 3f207df678b143eea3cee63160fa8bed 11761 11761 11761 0
5a8bc65990b245e5a138643cd4eb9837 14305 14305 0 14305

Modeling¶

Because, we are promoting and want to check what kind of offers customers are interested in therefore aimed at "offer_complete" then simulate what kind of offer is received. It is divided into "BOGO" and "DISCOUNT" with the following requirements:

  • BOGO is 1
  • DISCOUNT is 0

Modeling using the sklearn library has the following algorithms:

  1. LinearRegression
  2. LogisticRegression
In [107]:
#Include columns of interest.
df_trans_prof_offer = df_trans_prof[df_trans_prof['value_keys'] == 'offer id']
df_trans_prof_offer = df_trans_prof.merge(df_portfolio, right_on='id', how='right', left_on='values')
df_trans_prof_offer = df_trans_prof_offer.drop(columns=['channels', 'values', 'became_member_on', 'val', 
                                                        'time', 'income', 'age', 'difficulty',
                                                        'duration','gender', 'reward', 'person'], axis=0)
df_trans_prof_offer = df_trans_prof_offer[df_trans_prof_offer.event == 'offer completed']
df_trans_prof_offer
Out[107]:
event value_keys year range_of_age offer_type id email web social mobile
2 offer completed offer_id 2018 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
15 offer completed offer_id 2018 2 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
17 offer completed offer_id 2018 5 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
20 offer completed offer_id 2018 6 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
24 offer completed offer_id 2018 5 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
... ... ... ... ... ... ... ... ... ... ...
146603 offer completed offer_id 2013 3 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146607 offer completed offer_id 2013 6 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146609 offer completed offer_id 2013 2 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146612 offer completed offer_id 2013 6 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146614 offer completed offer_id 2013 5 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1

31943 rows × 10 columns

In [108]:
#Chrck offer_type in offer completed.
df_trans_prof_offer['offer_type'].value_counts()
Out[108]:
discount    16930
bogo        15013
Name: offer_type, dtype: int64
In [109]:
df_trans_prof_offer.shape[0]
Out[109]:
31943
In [110]:
#Create a "offer_type_code" column for keep the type by 1(gobo) and 0(discount)
df_trans_prof_offer['offer_type_code'] = 0

for i in range(df_trans_prof_offer.shape[0]):
    if df_trans_prof_offer.offer_type.iloc[i] == "bogo":
        df_trans_prof_offer['offer_type_code'].iloc[i] = 1
    else:
        df_trans_prof_offer['offer_type_code'].iloc[i] = 0
In [111]:
df_trans_prof_offer
Out[111]:
event value_keys year range_of_age offer_type id email web social mobile offer_type_code
2 offer completed offer_id 2018 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1 1
15 offer completed offer_id 2018 2 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1 1
17 offer completed offer_id 2018 5 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1 1
20 offer completed offer_id 2018 6 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1 1
24 offer completed offer_id 2018 5 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1 1
... ... ... ... ... ... ... ... ... ... ... ...
146603 offer completed offer_id 2013 3 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1 0
146607 offer completed offer_id 2013 6 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1 0
146609 offer completed offer_id 2013 2 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1 0
146612 offer completed offer_id 2013 6 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1 0
146614 offer completed offer_id 2013 5 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1 0

31943 rows × 11 columns

In [112]:
df_process = df_trans_prof_offer.iloc[:, :10]
df_process
Out[112]:
event value_keys year range_of_age offer_type id email web social mobile
2 offer completed offer_id 2018 7 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
15 offer completed offer_id 2018 2 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
17 offer completed offer_id 2018 5 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
20 offer completed offer_id 2018 6 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
24 offer completed offer_id 2018 5 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 0 1 1
... ... ... ... ... ... ... ... ... ... ...
146603 offer completed offer_id 2013 3 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146607 offer completed offer_id 2013 6 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146609 offer completed offer_id 2013 2 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146612 offer completed offer_id 2013 6 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1
146614 offer completed offer_id 2013 5 discount 2906b810c7d4411798c6938adc9daaa5 1 1 0 1

31943 rows × 10 columns

In [113]:
#Set columns.
cat_cols_lst =  df_process.columns
cat_cols = cat_cols_lst
cat_cols
Out[113]:
Index(['event', 'value_keys', 'year', 'range_of_age', 'offer_type', 'id',
       'email', 'web', 'social', 'mobile'],
      dtype='object')
In [114]:
for col in  cat_cols:
    df_process = pd.concat([df_process.drop(col, axis=1), pd.get_dummies(df_process[col], 
                                                                         prefix=col, 
                                                                         prefix_sep='_', 
                                                                         drop_first=True, 
                                                                         dummy_na=False)], axis=1)
In [115]:
df_process.head()
Out[115]:
year_2014 year_2015 year_2016 year_2017 year_2018 range_of_age_2 range_of_age_3 range_of_age_4 range_of_age_5 range_of_age_6 ... id_2298d6c36e964ae4a3e7e9706d1fb8c2 id_2906b810c7d4411798c6938adc9daaa5 id_4d5c57ea9a6940dd891ad53e9dbe8da0 id_9b98b8c7a33c4b65b9aebfe6a799e6d9 id_ae264e3637204a6fb9bb56bc8210ddfd id_f19421c1d4aa40978ebb69ca19b0e20d id_fafdcd668e3743c1bb461111dcafc2a4 web_1 social_1 mobile_1
2 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 1
15 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 1
17 0 0 0 0 1 0 0 0 1 0 ... 0 0 0 0 1 0 0 0 1 1
20 0 0 0 0 1 0 0 0 0 1 ... 0 0 0 0 1 0 0 0 1 1
24 0 0 0 0 1 0 0 0 1 0 ... 0 0 0 0 1 0 0 0 1 1

5 rows × 25 columns

In [116]:
df_process.columns
Out[116]:
Index(['year_2014', 'year_2015', 'year_2016', 'year_2017', 'year_2018',
       'range_of_age_2', 'range_of_age_3', 'range_of_age_4', 'range_of_age_5',
       'range_of_age_6', 'range_of_age_7', 'range_of_age_8', 'range_of_age_9',
       'range_of_age_10', 'offer_type_discount',
       'id_2298d6c36e964ae4a3e7e9706d1fb8c2',
       'id_2906b810c7d4411798c6938adc9daaa5',
       'id_4d5c57ea9a6940dd891ad53e9dbe8da0',
       'id_9b98b8c7a33c4b65b9aebfe6a799e6d9',
       'id_ae264e3637204a6fb9bb56bc8210ddfd',
       'id_f19421c1d4aa40978ebb69ca19b0e20d',
       'id_fafdcd668e3743c1bb461111dcafc2a4', 'web_1', 'social_1', 'mobile_1'],
      dtype='object')
In [117]:
# Set X is df_process
X = df_process

# Create column into response variables.
y = df_trans_prof_offer['offer_type_code']

# Split into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

1. LinearRegression¶

In [118]:
#LinearRegression
model_linear = LinearRegression(normalize=True)

#Fit
model_linear.fit(X_train, y_train)

#Predict model.
y_test_preds = model_linear.predict(X_test)
y_train_preds = model_linear.predict(X_train)

#Accuracy of model by LinearRegression.
acc = model_linear.score(X_test, y_test_preds)
print('Train Accuracy: %.2f' % (acc*100))
acc = model_linear.score(X_train, y_train_preds)
print('Train Accuracy: %.2f' % (acc*100))

#Loss of model by LinearRegression.
test_score = mean_squared_error(y_test, y_test_preds)
train_score = mean_squared_error(y_train, y_train_preds)
print('mean_squared_error: ' , test_score)
print('mean_squared_error: ' , train_score)

#r2_score of model by LogisticRegression.
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)
print('r2_score: ' , test_score)
print('r2_score: ' , train_score)

#Compare data.
print('\ny_test[:10]:', list(y_test[:10]))
print('y_train[:10]:', list(y_train[:10]))
print('y_test_preds_linear[:10]:', y_test_preds[:10])
print('y_test_preds_linear[:10]', y_train_preds[:10])
Train Accuracy: 100.00
Train Accuracy: 100.00
mean_squared_error:  5.150667009329443e-31
mean_squared_error:  5.184103150848712e-31
r2_score:  1.0
r2_score:  1.0

y_test[:10]: [1, 0, 1, 1, 1, 0, 0, 0, 0, 1]
y_train[:10]: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_test_preds_linear[:10]: [ 1.00000000e+00 -1.11022302e-16  1.00000000e+00  1.00000000e+00
  1.00000000e+00 -5.55111512e-16  4.99600361e-16 -3.33066907e-16
  1.11022302e-16  1.00000000e+00]
y_test_preds_linear[:10] [ 1.11022302e-16 -5.55111512e-16 -6.66133815e-16  1.00000000e+00
  1.00000000e+00  1.00000000e+00  1.00000000e+00  1.00000000e+00
  1.00000000e+00  1.00000000e+00]

2. LogisticRegression¶

In [119]:
#LogisticRegression
model_logis = LogisticRegression()

#Fit
model_logis.fit(X_train, y_train)

#Predict model.
y_test_preds = model_logis.predict(X_test)
y_train_preds = model_logis.predict(X_train)

#Accuracy of model by LogisticRegression.
acc = accuracy_score(y_test, y_test_preds)
print('Train Accuracy: %.2f' % (acc*100))
acc = accuracy_score(y_train, y_train_preds)
print('Train Accuracy: %.2f' % (acc*100))

#mean_squared_error of model by LogisticRegression.
test_score = mean_squared_error(y_test, y_test_preds)
train_score = mean_squared_error(y_train, y_train_preds)
print('mean_squared_error: ' , test_score)
print('mean_squared_error: ' , train_score)

#r2_score of model by LogisticRegression.
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)
print('r2_score: ' , test_score)
print('r2_score: ' , train_score)

#Compare data.
print('\ny_test[:10]:', list(y_test[:10]))
print('y_train[:10]:', list(y_train[:10]))
print('y_test_preds_logis[:10]:', y_test_preds[:10])
print('y_train_preds_logis[:10]', y_train_preds[:10])

#Confusion Matrix.
print('\nConfusion Matrix')
conf_matrix = confusion_matrix(y_test, y_test_preds)
print(conf_matrix)

accuracy = (conf_matrix[0][0] + conf_matrix[-1][-1]) / conf_matrix.sum()
print('\nAccuracy is', accuracy*100)

#Report by plot_confusion_matrix() function of matplotlib
classifier = model_logis.fit(X_train, y_train);
disp = plot_confusion_matrix(classifier, X_test, y_test, cmap=plt.cm.Blues);
print('\n', disp.confusion_matrix)

print(classification_report(y_test, y_test_preds))
Train Accuracy: 100.00
Train Accuracy: 100.00
mean_squared_error:  0.0
mean_squared_error:  0.0
r2_score:  1.0
r2_score:  1.0

y_test[:10]: [1, 0, 1, 1, 1, 0, 0, 0, 0, 1]
y_train[:10]: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_test_preds_logis[:10]: [1 0 1 1 1 0 0 0 0 1]
y_train_preds_logis[:10] [0 0 0 1 1 1 1 1 1 1]

Confusion Matrix
[[5051    0]
 [   0 4532]]

Accuracy is 100.0

 [[5051    0]
 [   0 4532]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5051
           1       1.00      1.00      1.00      4532

    accuracy                           1.00      9583
   macro avg       1.00      1.00      1.00      9583
weighted avg       1.00      1.00      1.00      9583

Results¶

According to modeling, LinearRegression and LogisticRegression give 100% accuracy, but different MSE and predict values. Obviously, the LinearRegression model gives a non-zero error while LogisticRegression give accurate results with a precision of 100% and an error of 0 in both forms MSE and RMSE.

Conclusion¶

The analysis process contains non-identifiable information such as gender, income, and exaggerated age. causing the data that has been erroneous from reality. Therefore, I removed the unclear information. almost of consumers have a wide range of ages. in order to be able to distinguish the presentation better. Therefore, I divided my age into ranges and predict the acceptance of the offer according to the age range. However, the presentation channel has changed with more years.

1. Portfolio:

Specify the details of the information presented in various formats. The "BOGO" format offers the most, of the most popular channels being email, and the least being modile. I think the presentation channel should reference the human behavior of technology as much as possible.

2. Profile:

Indicates the history of the consumer According to the preliminary analysis It had the highest number of new user registrations in 2017, with an initial age range of 18, and was most popular with users between the ages of 50 - 60. get high according to the proportion of men and Ling are similar But it can be seen that in 2013 male consumers were more popular than women. However, there are users that we cannot identify. because it is a user who does not register who do not know gender, income, including salary.

3. Transcript:

Specify the details of the consumer's use of the service. Up to 82.92% of users are registered customers. which may be a new customer or customers who are not registered with the store This prevents these users from receiving offers from Starbucks stores. However, the existing customer database still has a higher volume. This means that the product is in demand by the general public. At this point we pay attention to the response to the offers that the customers receive. It can be seen that 34.45% of people were interested in the offer and only 20% completed the offer. Out of a total of 167,581 proposal submissions.

Modeling will compare the two models using the Scikit-learn library, LinearRegression and LogisticRegression, which have precision MSE and RMSE as indicators. To predict consumer demand for offers from "offer_complete" with variables in years, age ranges, offer channels. and the format of the offer that affects the patterns that customers choose to receive In this modeling, only the offers of "BOGO" and "DISCOUNT" are interested.

1. LinearRegression model The accuracy is 100% and the RMSE is 100% but the MSE is not equal to zero. When comparing the data from the predictions, it was found that the values were distorted from reality.

2. LogisticRegression model The accuracy was 100% and the RMSE was 100% but the MSE was 0. When comparing the predicted data, they were found to be the same.

That means that the LogisticRegression model more appropriate for this data set. LinearRegression Even if it was confirmed with 100% accuracy. However, despite how suitable this model is, But this has already eliminated the information needed to create the model. And this model does not include individual customer attribute data. But this is considered successful in early modeling because the results are consistent with the existing data.

Improvements¶

According to modeling Let us know what kind of "GOBO" and "DISCOUNT" people are interested in by determining the age range, years, and promotion channels. I think this allows us to reach more consumers. However, having information on customer needs should increase the likelihood of a second response to a promotion. and make customers come back to use the service more I'm talking about delivering products to individual consumers based on their preferences.

There are only two model comparisons, so the SVM analysis, Naive Classification, K-Nearest Neighbors (K-NN), etc. would be the most likely extensions for this analysis.

In [ ]: